Skip to content

metal : optimize multi-sequence FA vec kernel #13493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 13, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented May 13, 2025

ref #10860 #13488

This should largely resolve the text-generation performance for multiple sequences with large prompts on Metal. I think this practically achieves the same effect as PagedAttention. This PR implements it for the FA-vec kernel - we simply skip fully-masked KV cache blocks with the size of the simdgroup (32). The other FA kernel for BS >= 4 already has a similar optimizaion, but I think it is not as optimal as it can be. It can now be improved by utilizing #12850 to precompute the skip conditions and pass those to the FA kernel as an extra tensor - this will be done in follow-up PR. (edit: nvm - it's already good enough).

Note that this also fixes the TG performance for SWA models, without having to do the defrag (see #13194 (comment)).

To test this, we can use the llama-batched-bench tool like this. It's important to generate more than one large prompt (via the -npl argument) in order to simulate a server with multiple slots. We observe that the TG speed is improved at large contexts.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-7b-coder/ggml-model-q4_k.gguf -c 40000 -b 2048 -ub 512 -npp 0,512,4096,8192 -ntg 32 -npl 2,3 -fa
  • master

main: n_kv_max = 40192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 32 2 64 0.171 0.00 0.555 115.21 0.726 88.11
0 32 3 96 0.021 0.00 0.741 129.53 0.763 125.89
512 32 2 1088 0.856 1196.64 0.599 106.76 1.455 747.67
512 32 3 1632 1.267 1212.00 0.788 121.85 2.055 794.09
4096 32 2 8256 7.339 1116.21 0.659 97.09 7.998 1032.22
4096 32 3 12384 11.070 1110.04 1.063 90.31 12.133 1020.70
8192 32 2 16448 16.144 1014.89 0.767 83.44 16.911 972.64
8192 32 3 24672 24.644 997.26 1.408 68.20 26.051 947.06
  • PR

main: n_kv_max = 40192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 32 2 64 0.178 0.00 0.555 115.35 0.733 87.36
0 32 3 96 0.021 0.00 0.741 129.50 0.762 125.95
512 32 2 1088 0.860 1191.01 0.602 106.39 1.461 744.53
512 32 3 1632 1.270 1209.18 0.762 125.96 2.032 802.97
4096 32 2 8256 7.317 1119.61 0.611 104.67 7.928 1041.33
4096 32 3 12384 11.086 1108.44 0.869 110.43 11.955 1035.86
8192 32 2 16448 16.128 1015.88 0.670 95.49 16.798 979.16
8192 32 3 24672 24.585 999.62 1.004 95.61 25.590 964.14

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 13, 2025
@ggerganov
Copy link
Member Author

ggerganov commented May 13, 2025

@JohannesGaessler I am wondering if we can apply the same optimization to the CUDA FA kernels. I don't have currently a suitable CUDA machine that I can use for development, so just sharing some thoughts if you have some time to try out.

The idea is that the FA-vec kernel can directly skip blocks for which the KQ mask is full of -INF. For single-sequence generations this does not happen so we won't see any improvement, but it should improve a lot multi-sequence generations and the upcoming SWA models.

I took a quick look at the code and have the following suggestion:

diff --git a/ggml/src/ggml-cuda/fattn-vec-f16.cuh b/ggml/src/ggml-cuda/fattn-vec-f16.cuh
index d96e39212..171652dc8 100644
--- a/ggml/src/ggml-cuda/fattn-vec-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec-f16.cuh
@@ -175,6 +175,22 @@ static __global__ void flash_attn_vec_ext_f16(
     for (int k_VKQ_0 = blockIdx.y*D; k_VKQ_0 < ne11; k_VKQ_0 += gridDim.y*D) {
         // Calculate KQ tile and keep track of new maximum KQ values:
 
+        // mask -INF blocks
+        half mask_max = -INF_HALF;
+#pragma unroll
+        for (int j = 0; j < ncols; ++j) {
+            for (int i_KQ_0 = 0; i_KQ_0 < D; i_KQ_0 += nwarps) {
+                const int i_KQ = i_KQ_0 + threadIdx.y;
+                // reuse the array for masking -INF blocks
+                mask_max = max(mask_max, maskh[j*ne11 + k_VKQ_0 + i_KQ]);
+            }
+        }
+
+        mask_max = warp_reduce_max(mask_max); // TODO: not sure what it the function, but simply warp-reduce the var
+        if (mask_max == -INF_HALF) {
+            continue;
+        }
+
         // For unknown reasons using a half array of size 1 for kqmax_new causes a performance regression,
         // see https://github.com/ggerganov/llama.cpp/pull/7061 .
         // Therefore this variable is defined twice but only used once (so that the compiler can optimize out the unused variable).

It's not tested at all, but I hope mainly to illustrate the idea. There might be a better way to do it.

Edit: Btw, it might be worth trying to force using the ncols == 1 kernel because otherwise the check will never trigger for this particular test. Probably like this if I am following the logic correctly:

diff --git a/ggml/src/ggml-cuda/fattn-vec-f16.cuh b/ggml/src/ggml-cuda/fattn-vec-f16.cuh
index d96e39212..3541b63ec 100644
--- a/ggml/src/ggml-cuda/fattn-vec-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec-f16.cuh
@@ -335,7 +351,7 @@ void ggml_cuda_flash_attn_ext_vec_f16_case(ggml_backend_cuda_context & ctx, ggml
     float logit_softcap;
     memcpy(&logit_softcap, (const float *) KQV->op_params + 2, sizeof(float));
 
-    if (Q->ne[1] == 1) {
+    if (true) {
         constexpr int cols_per_block = 1;
         if (logit_softcap == 0.0f) {
             constexpr bool use_logit_softcap = false;

The same logic should be possible to apply to the non-vec FA kernel, but this is less important for now. If you could give this a try and run the llama-batched-bench command from this PR (make sure to have the llama-batched-bench changes from this branch), would appreciate. I am interested in the S_TG numbers before and after this change. Thanks.

@JohannesGaessler
Copy link
Collaborator

For token generation the bottleneck is I/O. Tensor cores work on matrix fragments of at least 8 rows/columns. Without GQA for <= 8 tokens each value from the KV cache is being loaded only once. So for <= 8 parallel sequences you can basically never skip any KV slices unless you use an implementation without tensor cores (the vec kernels do not). The vec kernels only work for a single token, for >1 tokens they would need to use KV values for >1 Q values and you would quickly run into the same problem where each KV value is basically always used for at least one sequence and you can never skip it completely. At most you can skip some of the compute but as long as you're I/O bound all of the compute pipelines are going to be underutilized anyways.

However, if a model is using GQA then you can get a much higher arithmetic intensity for each KV value since you can apply it to multiple Q values for each sequence. In such a scenario an optimization for potentially skipping a KV slice could work because you're much less I/O bound. As it is the CUDA kernels entangle up to 64 Q columns though so for this optimization to work correctly the selection logic for the number of parallel columns needs to know that the tokens belong to different sequences.

So the bottom line is that for CUDA this optimization will only make sense for at least 8 / (Q->ne[2] / K->ne[2]) parallel sequences and only if an op param is set to indicate that there will be little to no overlap between sequences.

I don't have currently a suitable CUDA machine that I can use for development, so just sharing some thoughts if you have some time to try out.

I have a machine with 6 RTX 4090s in my basement, I can give you access if it would be useful for your work.

@JohannesGaessler
Copy link
Collaborator

Just so I don't misunderstanding anything: multiple parallel sequences result in an increase in Q->ne[1], correct?

Also I forgot: for SWA specifically modifying the kernels to skip fully masked-out KV slices would work without issue.

@JohannesGaessler
Copy link
Collaborator

The CUDA kernel in fattn-mma-f16.cuh is using a stream-k decomposition where it's already working on fractional tiles. If the sequences had contiguous KV cache slices you could in principle also just adjust the lower and upper bounds of those fractional tiles.

@JohannesGaessler
Copy link
Collaborator

Edit: Btw, it might be worth trying to force using the ncols == 1 kernel because otherwise the check will never trigger for this particular test. Probably like this if I am following the logic correctly:

I should have probably been more explicit about this: you do not gain any advantage in terms of I/O if you make this change. At most you will gain an advantage in terms of compute. The vec kernels would first need to receive the same GQA optimizations that the kernel in fattn-mma-f16.cuh has.

@ggerganov
Copy link
Member Author

Just so I don't misunderstanding anything: multiple parallel sequences result in an increase in Q->ne[1], correct?

Generally yes, but there is also another use case. For example in #10860, using the provided Python runner script in sequential mode, would have Q->ne[1] == 1 during the text generation phase, but the different requests will be processed as different sequences (because the server has -np 15). So the KV cache will be very large after processing a few sequences, but the batch size will remain 1. The optimization in this PR skips a large amount of the computation in the FA vec kernel and improves the total runtime by ~20% for the Metal backend.

The script there also allows to run in parallel mode - in that case yes, Q->ne[1] = n_parallel during text generation.

I have a machine with 6 RTX 4090s in my basement, I can give you access if it would be useful for your work.

No worries, I was looking for a quick test.

Just to understand, is the flash_attn_vec_ext_f16 kernel ever executed for Ampere cards? And if yes, for which batch sizes (i.e. values of Q->ne[1])?

@JohannesGaessler
Copy link
Collaborator

For Ampere that kernel should only be used for batch size 1 and if not using GQA, but only because the mma implementation turned out to be faster due to the GQA optimization.

@ggerganov
Copy link
Member Author

Ok thanks, I think I understand better now.

The CUDA kernel in fattn-mma-f16.cuh is using a stream-k decomposition where it's already working on fractional tiles. If the sequences had contiguous KV cache slices you could in principle also just adjust the lower and upper bounds of those fractional tiles.

Could you clarify what does fractional tiles mean in this case?

To give some additional context, the main drawback of the unified KV cache implementation is that it can end up attending to many blocks that are masked by the KQ mask (i.e. cross-sequence attention), which we basically discard. Even though we don't need it, we still perform the computation over such -INF tiles.

I am thinking of a way to "filter" the tiles that are already masked by the KQ mask so that we save some computation. I could be missing something (so feel free to ignore or correct me), but another idea is to run a quick kernel to create a list of the tiles that have at least one non-INF cell and then in the fattn-mma-f16.cuh kernel we would loop over this filtered list of tiles, instead of over all tiles.

I am not following fully the explanation about the I/O of the kernel, but my thinking is very simple - the work area of the kernel is roughly: N_KV x BS (ignoring details such as heads and reduction of the results). In certain use cases, large blocks of this work area are practically irrelevant because the KQ mask is -INF and it would eliminate their contributions. Can we somehow avoid these areas to reduce the surface of the work area?

Base automatically changed from gg/batched-bench-fix-pp to master May 13, 2025 15:01
@ggerganov ggerganov merged commit c252e0c into master May 13, 2025
46 checks passed
@ggerganov ggerganov deleted the gg/metal-fa-vec-mask-opt branch May 13, 2025 15:04
@JohannesGaessler
Copy link
Collaborator

Could you clarify what does fractional tiles mean in this case?

When I say "tile", I mean a tile of the output tensor. So VKQ in the case of FA.

The standard way to parallelize a matrix multiplication is to assign some output tile to a CUDA block, the CUDA block then iterates over the input matrices to calculate said tile. The accumulators are kept in registers and you can simply write out the output tile. The problem is that modern GPUs are becoming increasingly "wide", meaning that the number of streaming multiprocessors has become much larger than in previous generations. This causes "tail effects" where the last wave of CUDA blocks cannot fully utilize the hardware; in the worst case scenario with 1.01 waves you basically lose half the performance.

With a stream-k decomposition CUDA blocks can work on fractions of output tiles and as a consequence the start and end points in the continuous ne0 x ne1 x ne00 can be assigned much more evenly to SMs. In return it becomes necessary to run a "fixup" kernel afterwards. But the runtime of that kernel scales only with the number of SMs, not the size of the input matrices. For FA there is an additional complication from having to rescale the partial results based on the maximum KQ values but the basic principle is the same.

I am not following fully the explanation about the I/O of the kernel, but my thinking is very simple - the work area of the kernel is roughly: N_KV x BS (ignoring details such as heads and reduction of the results). In certain use cases, large blocks of this work area are practically irrelevant because the KQ mask is -INF and it would eliminate their contributions. Can we somehow avoid these areas to reduce the surface of the work area?

One of the most basic strategies for good GPU performance is to achieve a high arithmetic intensity: you want to do as much work as possible for each data value that you load from memory. So for MMQ and FA I'm trying to work with large output tiles because the larger the output tiles are the more work can be done per data loaded from memory. The dimension that is the most difficult to scale for language models is ne11 because for a single user who is generating tokens without speculative decoding it is always 1. If you instead have 8 users ne11 is 8 and each CUDA block can do 8 times the work per loaded KV value. But on the other hand 7/8 of the work is useless because it's being masked out. But on the other other hand the tensor compute pipeline is going to be severely underutilized anyways so it doesn't matter that most of the work is useless. If you were to instead split the KV cache per user you would end up with 8 CUDA blocks that load 1/8 of the data each. The total data loaded is going to be the same but you are not wasting any compute. But now your matrix is too narrow in the ne11 dimension to use tensor cores and you were not bottlenecked by compute in the first place. So I don't think it's going to make a large difference - if the t/s scaling with batch size was perfect there would not be any difference.

Basically the problem is that right now the CUDA code always tries to work on as many ne11 values as possible in parallel so any KV slice that is used by at least one sequence cannot be skipped unless the size of output tiles is reduced which carries a performance penalty. However, when a model is using GQA that can be used to get a higher arithmetic intensity at lower ne11 values. The mma implementation is already doing this but due to the use of tensor cores the granularity in ne11 is still high which causes different sequences to become entangled unless the GQA ratio is at least 8.

What I'll do is add the GQA optimization to the vec kernels. Then it should be possible to use them for GQA models on all GPUs without a performance penalty relative to the mma kernel. For the vec kernels it's always possible to evaluate sequences independently from one another so an optimization to skip masked-out KQ slices would make sense. If the indices of relevant KQ slices were to be precomputed it would of course be a bit faster. If all sequences are disjointed from one another then there would never be a benefit to scaling up the ne11 dimension of a CUDA block because the percentage of wasted work would exactly cancel out with the increase in arithmetic intensity. So those cases should be somehow marked (like with an op param) and for those cases the vec kernel should be used regardless of ne11.

@ggerganov
Copy link
Member Author

ggerganov commented May 13, 2025

Yes, this makes complete sense - thank you for the detailed answer. I think the reasoning is clear.

I would flag one detail that IMO would warrant some extra looking into - the conclusions about the performance of the FA vec kernel in #12014 about increasing the arithmetic intensity of the kernel is valid mainly for the speculative decoding use case where the small batch of tokens is from the same sequence. However, for small batches with mixed sequences it might not be the case. The llama-bench -ub 1,2,3,4... -p 16384 test is relevant for the same-sequence batch and this was the target to optimize. But if we take a look at the llama-batched-bench -npl 1,2,3,4... test (see OP) I suspect the results would be opposite.

So those cases should be somehow marked (like with an op param) and for those cases the vec kernel should be used regardless of ne11.

I agree that marking the batches would bring the optimal performance in the 2 scenarios. However, this would come with some extra complexity in the batching logic. Not saying it's not worth it, but my guess would be to first explore the idea of:

For the vec kernels it's always possible to evaluate sequences independently from one another so an optimization to skip masked-out KQ slices would make sense. If the indices of relevant KQ slices were to be precomputed it would of course be a bit faster. If all sequences are disjointed from one another then there would never be a benefit to scaling up the ne11 dimension of a CUDA block because the percentage of wasted work would exactly cancel out with the increase in arithmetic intensity.

This is very easy to implement in the FA-vec kernel, even without precomputing the indices (see the diff that I suggested earlier). And even if it hurts the speculative decoding case by ~5%, I think it might significantly improve the multi-sequence use case. So it would be a simple intermediate step before we decide if it is worth to mark the multi-sequence batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants